Preparing text to use with TensorFlow models

The high level steps to prepare text to be used in a machine learning model are:

  1. Tokenize the words to get numerical values for them
  2. Create numerical sequences of the sentences
  3. Adjust the sequences to all be the same length.

In this colab, you learn how to use padding to make the sequences all be the same length.

Import the classes you need

In [1]:
# Import Tokenizer and pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

Write some sentences

Feel free to write your own sentences here.

In [2]:
sentences = [
    'My favorite food is ice cream',
    'do you like ice cream too?',
    'My dog likes ice cream!',
    "your favorite flavor of icecream is chocolate",
    "chocolate isn't good for dogs",
    "your dog, your cat, and your parrot prefer broccoli"
]
print(sentences)
['My favorite food is ice cream', 'do you like ice cream too?', 'My dog likes ice cream!', 'your favorite flavor of icecream is chocolate', "chocolate isn't good for dogs", 'your dog, your cat, and your parrot prefer broccoli']

Create the Tokenizer and define an out of vocabulary token

When creating the Tokenizer, you can specify the max number of words in the dictionary. You can also specify a token to represent words that are out of the vocabulary (OOV), in other words, that are not in the dictionary. This OOV token will be used when you create sequences for sentences that contain words that are not in the word index.

In [3]:
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")

Tokenize the words

In [4]:
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)
{'<OOV>': 1, 'your': 2, 'ice': 3, 'cream': 4, 'my': 5, 'favorite': 6, 'is': 7, 'dog': 8, 'chocolate': 9, 'food': 10, 'do': 11, 'you': 12, 'like': 13, 'too': 14, 'likes': 15, 'flavor': 16, 'of': 17, 'icecream': 18, "isn't": 19, 'good': 20, 'for': 21, 'dogs': 22, 'cat': 23, 'and': 24, 'parrot': 25, 'prefer': 26, 'broccoli': 27}

Turn sentences into sequences

Each word now has a unique number in the word index. However, words in a sentence are in a specific order. You can't just randomly mix up words and have the outcome be a sentence.

For example, although "chocolate isn't good for dogs" is a perfectly fine sentence, "dogs isn't for chocolate good" does not make sense as a sentence.

So the next step to representing text in a way that can be meaningfully used by machine learning programs is to create numerical sequences that represent the sentences in the text.

Each sentence will be converted into a sequence where each word is replaced by its number in the word index.

In [5]:
sequences = tokenizer.texts_to_sequences(sentences)
print (sequences)
[[5, 6, 10, 7, 3, 4], [11, 12, 13, 3, 4, 14], [5, 8, 15, 3, 4], [2, 6, 16, 17, 18, 7, 9], [9, 19, 20, 21, 22], [2, 8, 2, 23, 24, 2, 25, 26, 27]]

Make the sequences all the same length

Later, when you feed the sequences into a neural network to train a model, the sequences all need to be uniform in size. Currently the sequences have varied lengths, so the next step is to make them all be the same size, either by padding them with zeros and/or truncating them.

Use f.keras.preprocessing.sequence.pad_sequences to add zeros to the sequences to make them all be the same length. By default, the padding goes at the start of the sequences, but you can specify to pad at the end.

You can optionally specify the maximum length to pad the sequences to. Sequences that are longer than the specified max length will be truncated. By default, sequences are truncated from the beginning of the sequence, but you can specify to truncate from the end.

If you don't provide the max length, then the sequences are padded to match the length of the longest sentence.

For all the options when padding and truncating sequences, see https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences

In [6]:
padded = pad_sequences(sequences)
print("\nWord Index = " , word_index)
print("\nSequences = " , sequences)
print("\nPadded Sequences:")
print(padded)
Word Index =  {'<OOV>': 1, 'your': 2, 'ice': 3, 'cream': 4, 'my': 5, 'favorite': 6, 'is': 7, 'dog': 8, 'chocolate': 9, 'food': 10, 'do': 11, 'you': 12, 'like': 13, 'too': 14, 'likes': 15, 'flavor': 16, 'of': 17, 'icecream': 18, "isn't": 19, 'good': 20, 'for': 21, 'dogs': 22, 'cat': 23, 'and': 24, 'parrot': 25, 'prefer': 26, 'broccoli': 27}

Sequences =  [[5, 6, 10, 7, 3, 4], [11, 12, 13, 3, 4, 14], [5, 8, 15, 3, 4], [2, 6, 16, 17, 18, 7, 9], [9, 19, 20, 21, 22], [2, 8, 2, 23, 24, 2, 25, 26, 27]]

Padded Sequences:
[[ 0  0  0  5  6 10  7  3  4]
 [ 0  0  0 11 12 13  3  4 14]
 [ 0  0  0  0  5  8 15  3  4]
 [ 0  0  2  6 16 17 18  7  9]
 [ 0  0  0  0  9 19 20 21 22]
 [ 2  8  2 23 24  2 25 26 27]]
In [7]:
# Specify a max length for the padded sequences
padded = pad_sequences(sequences, maxlen=15)
print(padded)
[[ 0  0  0  0  0  0  0  0  0  5  6 10  7  3  4]
 [ 0  0  0  0  0  0  0  0  0 11 12 13  3  4 14]
 [ 0  0  0  0  0  0  0  0  0  0  5  8 15  3  4]
 [ 0  0  0  0  0  0  0  0  2  6 16 17 18  7  9]
 [ 0  0  0  0  0  0  0  0  0  0  9 19 20 21 22]
 [ 0  0  0  0  0  0  2  8  2 23 24  2 25 26 27]]
In [8]:
# Put the padding at the end of the sequences
padded = pad_sequences(sequences, maxlen=15, padding="post")
print(padded)
[[ 5  6 10  7  3  4  0  0  0  0  0  0  0  0  0]
 [11 12 13  3  4 14  0  0  0  0  0  0  0  0  0]
 [ 5  8 15  3  4  0  0  0  0  0  0  0  0  0  0]
 [ 2  6 16 17 18  7  9  0  0  0  0  0  0  0  0]
 [ 9 19 20 21 22  0  0  0  0  0  0  0  0  0  0]
 [ 2  8  2 23 24  2 25 26 27  0  0  0  0  0  0]]
In [9]:
# Limit the length of the sequences, you will see some sequences get truncated
padded = pad_sequences(sequences, maxlen=3)
print(padded)
[[ 7  3  4]
 [ 3  4 14]
 [15  3  4]
 [18  7  9]
 [20 21 22]
 [25 26 27]]

What happens if some of the sentences contain words that are not in the word index?

Here's where the "out of vocabulary" token is used. Try generating sequences for some sentences that have words that are not in the word index.

In [10]:
# Try turning sentences that contain words that 
# aren't in the word index into sequences.
# Add your own sentences to the test_data
test_data = [
    "my best friend's favorite ice cream flavor is strawberry",
    "my dog's best friend is a manatee"
]
print (test_data)

# Remind ourselves which number corresponds to the
# out of vocabulary token in the word index
print("<OOV> has the number", word_index['<OOV>'], "in the word index.")

# Convert the test sentences to sequences
test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequence = ", test_seq)

# Pad the new sequences
padded = pad_sequences(test_seq, maxlen=10)
print("\nPadded Test Sequence: ")

# Notice that "1" appears in the sequence wherever there's a word 
# that's not in the word index
print(padded)
["my best friend's favorite ice cream flavor is strawberry", "my dog's best friend is a manatee"]
<OOV> has the number 1 in the word index.

Test Sequence =  [[5, 1, 1, 6, 3, 4, 16, 7, 1], [5, 1, 1, 1, 7, 1, 1]]

Padded Test Sequence: 
[[ 0  5  1  1  6  3  4 16  7  1]
 [ 0  0  0  5  1  1  1  7  1  1]]